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ABSTRACT 

This paper presents Dialogos, a real-time system for 
human-machine spoken dialogue on the telephone in 
task-oriented domains. The system has been tested in 
a large trial with inexperienced users and it has proved 
robust enough to allow spontaneous interactions both 
to users which get good recognition performance and to 
the ones which get lower scores. The robust behavior of 
the system has been achieved by combining the use of 
specific language models during the recognition phase 
of analysis, the tolerance toward spontaneous speech 
phenomena, the activity of a robust parser, and the 
use of pragmatic-based dialogue knowledge. This in- 
tegration of the different modules allows to deal with 
partial or total breakdowns of the different levels of 
analysis. We report the field trial data of the system 
and the evaluation results of the overall system and of 
the submodules. 

1. INTRODUCTION 

During the past few years the recognition of sponta- 
neous speech in telephone dialogues has greatly im- 
proved. Nevertheless the natural spoken dialogue be- 
tween computers and inexperienced users still presents 
some problematic issues, such as the real-time man- 
aging of large vocabularies, the robustness toward dif- 
ferent pronunciations of a given natural language, and 
the ability of handling miscommunications within co- 
operative human-machine dialogues. Before delivering 
telephone-based spoken language applications to the 
general public, we have to define effective methodolo- 
gies for overcoming these problems. 

We present a telephone spoken dialogue system, Di- 
alogos, that has been designed and implemented on the 
basis of the principle of strict integration among the 
different levels of analysis of user's utterances. That 
means that all the system modules are able to deal 
with partial or total breakdowns of the other modules. 

Dialogos is a real time system that understands spo- 
ken Italian in the domain of railway timetable inquiry. 
It works on the public telephone network and it does 
not require any training to be used by inexperienced 



users. Its dictionary contains 3,471 words, including 
2,983 proper names of the Italian railway stations. 

The system is composed of a set of modules: the 
acoustical front-end, the acousting processor, the lin- 
guistic processor, the dialogue manager and the text- 
to-speech synthesizer, which is the ELOQUENS com- 
mercial system by CSELT. A telephone interface con- 
nects the acoustical front-end and the synthesizer to 
the public telephone network, while the dialogue man- 
ager is connected to the railway timetable database. 
The telephone interface and the synthesizer are housed 
on a PC 486 equipped with Dialogic D41E boards. The 
railway time-table is on a PC Pentium and the rest of 
the system is software only and runs on a DEC Alpha 
2100. 

2. ACOUSTIC PROCESSING 

The telephone signal, which has a band of 300-3400 Hz, 
is sampled at a frequency of 8 KHz. The pre-processing 
technique consists of a MEL-based spectral analysis fol- 
lowed by a Discrete Cosine Transform yelding a vector 
of 12 Cepstral Coefficients each 10 ms. In addition, the 
value of the logarithm of the total energy is retained as 
it provides some information about distinguishing the 
voiced parts of the speech from the unvoiced ones. First 
and second order derivatives of the log energy and of 
the 12 cepstral coefficients are also calculated resulting 
in a frame made up of 39 parameters. 

The acoustic modeling is based on a hybrid HMM- 
NN (Hidden Markov Model-Neural Network) model jl[ 
of the same class as that described in [||. Each word 
is described in terms of a left-to-right automaton (with 
self loops) , obtained by concatenating elementary acous- 
tic units. The posterior probability P{Q\X) of the 
automata states are estimated by a Multi-layer Per- 
ceptron (MLP) neural network. The training of the 
acoustic model simultaneously finds the best segmen- 
tation of words into phonemes and of phonemes into 
states and trains the network to discriminate between 
these states. 

Recently, Fissore et alii Q introduced a new set 
of units, called Stationary- Transitional Units (STU), 
which have been adopted instead of phonemes. These 



units are made up of stationary parts of the context 
independent phonemes plus all the admissible transi- 
tions between them for a total of 391 units. This set 
of STU is language dependent but domain indepen- 
dent, and represents a partition of the sounds of the 
language, like phonemes, but with more acoustic de- 
tail. The used MLP has one input layer that looks at 7 
frames and two hidden layers. The output layer, fully 
connected, contains one unit for each STU. The total 
number of weights is 195,000. 

The telephone quality speech used to train the HMM- 
NN has the following features: 

• read speech, domain independent, 1,136 speakers, 
about 8,000 utterances; 

• spontaneous speech, domain dependent, about 
3,580 utterances 

The recognition algorithm is based on frame syn- 
chronous Viterbi decoding. The recognition algorithm 
can work either in isolated or in continuous recognition 
mode and can be applied to different sets of words (vo- 
cabularies) to meet the requirements of the dialogue 
manager. 

3. LANGUAGE MODELING 

The language model (LM) is a class-based bigram one. 
There are 358 classes; 348 of them contain a single 
word, while the remaining 10 classes contain semanti- 
cally important words, such as city names (2,983 words), 
station names (33 words), numbers (76 words), months, 
week days, and so on. 

The bigram model was trained on a set of 30,000 
sentences, which was composed of two parts: written 
material (86%), and sentences acquired during a past 
trial (14%). Currently the bigrams are smoothed using 
a linear interpolation algorithm, because the training 
set was too poor for performing other kinds of smooth- 
ing |. 

Recently the use of dialogue-dependent prediction 
LMs have been integrated into the Dialogos system, 
see These models are trained on a dialogue-de- 
pendent partition of a corpus acquired from a dialogue 
system according to the dialogue point in which an ut- 
terance was given. Our work is related to the static 
predictions of and to the dialogue step- dependent 
models of (t). On a test-set of 2,040 utterances, the 
use of dialogue- dependent predictions reduces the er- 
ror rate of WA by 8.6% and of SU by 10.9%. 

4. LINGUISTIC PROCESSING 

The linguistic processor starts from the best-decoded 
sequence; it performs a multi-step robust partial pars- 
ing and, at the end of the analysis, it constructs the 
deep semantic representation of the user utterance in 



the form of a case frame and sends it to the dialogue 
module. The parser is designed to achieve robust per- 
formance; it is an evolution of the parser described 
in [^; studied to allow a faster definition of the lin- 
guistic knowledge to be used in application domains in 
the field of information inquiry. Only the grammatical 
structures that can give a contribution to the discrimi- 
nation between different domain concepts conveyed by 
a given lexical item need to be defined and used. 

Parsing is performed in three steps: a step of local 
grammatical analysis and two steps of semantic anal- 
ysis. The grammatical analysis assigns to each lexical 
item a set of non terminals, that is, the union of the 
paths that in each syntactic tree connects that lexical 
item to the root. Notice that these trees do not nec- 
essarily cover the whole utterance: they are only the 
larger grammatical structures that include the given 
word. In addition, the trees pertaining to a lexical 
item do not necessarily cover the same utterance seg- 
ment. To achieve robustness, local grammatical analy- 
sis is performed iteratively, starting from each word of 
the utterance and generating all the local grammatical 
structures that cover the utterance segments starting 
with such a word and being as long as possible. 

The grammar used to perform local grammatical 
analysis is written using a context-free like formalism; 
it is a 'semantic grammar' in the sense that the non- 
terminal names have to be defined considering not only 
syntactic knowledge but also a certain amount of se- 
mantic knowledge useful for the subsequent steps of 
semantic analysis. 

The first step of the semantic analysis is completely 
local; it collects a set of application concepts, each one 
characterized by a score that represents the degree of 
linguistic reliability. The second step solves conflicts 
amongst these concepts and selects a set of mutually 
compatible application concepts. 

5. DIALOGUE MANAGEMENT 

The dialogue module (DM) has been designed to cope 
with task-oriented spoken langauage applications: that 
is, the DM performs its communicative actions to achieve 
the goal of collecting the parameters for accessing the 
database. At each turn of interaction with the user, 
the DM interprets the user's utterance on the basis of 
the dialogue history and of the contextual knowledge, 
and it selects a dialogue act that allows to address the 
user with a contextually appropriate message. 

At each step of the human-machine interaction, the 
contextual knowledge of the DM is expressed in terms 
of pragmatic-based expectations about what the user 
could probably say in her/his next utterance. The pos- 
sible discrepancies between the expectations of the sys- 
tem and the actual user's behavior are interpreted as 
symptoms of a breakdown in some previous steps of the 
ongoing interaction |^. When that happens, the sys- 



tern is able to continue the user-initiated repair. More- 
over, the DM itself is able to initiate the recovering 
from other subcomponent errors both in case of total 
non- understanding and in case of partial inconsisten- 
cies. 

Details of the implementation of the dialogue mod- 
ule are given in jl^. Briefly, the dialogue strategy of 
the DM assumes that both the user and the system 
cooperates for achieving the goal of their linguistic in- 
terchange. In our application domain that means that 
the user's goal and the system's goal converge to the 
identification of the parameters needed to access the 
data base, i.e. the departure and arrival cities, the 
date and the time of the travel. The DM prompts the 
user to provide such parameters, in an ordered fashion. 
However, the DM is able to deal with parameters which 
are relevant to the task and which are spontaneously 
offered by the user. 

The DM interacts with the speech recognizer and 
with the database server. The interaction with the rec- 
ognizer is implemented by passing to it the expecta- 
tions of the DM in the form of predictions of class of 
words and phrases. Moreover, on the basis of the oc- 
currence of repetitive recognition failures the DM may 
require the acquisition of some crucial parameters to 
be done in isolated speech recognition modality. 

The interaction with the database is bi-directional: 
on one hand, the DM simply sends to the database the 
queries as soon as the parameters involved have been 
acquired; on the other hand, it makes use of applica- 
tion dependent information for tailoring the dialogue 
strategy according to the kind of information actually 
needed to access the data-base. 

There is an increasing aweraness that spoken lan- 
guage systems may greatly benefit from a robust di- 
alogue management In a previous work |T^, we 
have identified two metrics (the explicit and the im- 
plicit recovery) that may be used to evaluate the ro- 
bustness of the system by measuring the DM's ability 
to recover from miscommunications. By experimenting 
a previous version of the system with semi-naive and 
naive users, we deemed that the DM increased by 17% 
the contextual appropriateness of the system answers. 

6. FIELD TRIAL EVALUATIONS 

An extensive field trial was carried out with 493 Italian 
subjects. Subjects were recruited from all over Italy; 
they were statistically distributed, with regards to their 
regional origin, as the Telecom Italia users are. Sub- 
jects selected were roughly half male and half female, 
in the age range from 18 to over 65, and with different 
levels of education. 

Each subject had to do three telephone calls: in 
each one she/he had to plan a trip from a given city 
to another one. In the first call the subjects followed 
a pre-defined scenario that specified the departure and 



the arrival cities, while in the third call they were free 
to choose both the departure and the arrival point; in 
each one of the three calls they were free to decide the 
date and the time of departure. 

The collected corpus consists of 1,363 dialogues for 
a total of 13,123 utterances. All the calls were per- 
formed over the public telephone network but in three 
different environments: house (80.3% calls), telephone 
box (9.9% calls) and some very noisy environments 
such as streets, cars, stations, and underground (9.7% 
calls). Four different kinds of telephone were used: 
DTMF phones used both in the house and telephone 
box (76.3% calls), dial phones (8.1% calls), cordless 
(5.9% calls), and mobile phones (9.7% calls). The mo- 
bile phones were always used in a noisy environment. 

All the speech material acquired, 18 hours of speech, 
was manually transcribed and evaluated (487 Mbytes 
of data). 

The dialogues have been evaluated both from the 
point of view of the overall system and from the point of 
view of the recognition and linguistic processing mod- 
ules. With regards to the system's overall performance 
we classify each dialogue of the corpus into one of the 
following classes: 

• SUCCESS (S): complete successful dialogues: all 
the user parameters (departure, arrival, date, and 
time) have been correctly acquired and those pa- 
rameters were used to access the database. 

• SUCCESS with CONSTRAINT RELAXATION 
(SC): successful dialogue where one parameter 
(date or time) was not recognized and the database 
is accessed with a default value, tomorrow for 
date and the main train connections of day for 
time. 

• SYSTEM FAILURE (SF): dialogues that failed 
due to various kind of system inadequacies. 

• USER FAILURE (UF): dialogues that failed due 
to a non-cooperative user behavior. 
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Figure 1: Summary of Transaction Success 

Figure |l| shows the summary of transaction success: 
if we put together the S and SC dialogues we obtain the 



percentage of 71.7% successful dialogues. If we exclude 
from the corpus the dialogues failed for user mistakes, 
we obtain the upper bound of the measure of transac- 
tion success, i.e. 84.4%. 

Analysing the three different scenarios, we can ob- 
serve that users are able to adapt their speaking styles 
in order to be better understood by the system: they 
probably learn to speak after the tone. Both the users' 
and the system errors decrease from the first dialogue 
to the second, SF from 12.5% to 10.3% while UF from 
19.1% to 14.3%. In the third dialogue users continue to 
learn (their errors decrease to 12.0%), but the system 
failures increase to 16.8%, partially because the users 
asked connections for cities which were not present in 
the database. 

We have also taken into account the different envi- 
ronments and telephone types used in the trial. It can 
be noticed that the DTMF telephone obtains the best 
results (S 85.5%) while the dial phone obtains the worst 
results (S 77.1%) and mobile phone, even if used in very 
noisy environment, obtains good results (S 80.0%). 

The average duration of the S dialogues is near to 
2 minutes. That time includes the readings of the re- 
trieved railway information, which almost depends on 
the selected cities; 60% of the S dialogues obtained 
the parameters to access the database in less than one 
minute. 

We evaluated the 13,123 corpora sentences from the 
point of view of the recognition (word accuracy, WA) 
and understanding (sentence understanding, SU) per- 
formance; we obtain 61% of WA and 76% of SU. It is 
important to observe that 19% of the utterances are af- 
fected by various kinds of spontaneous speech phenom- 
ena. In order of importance they are: shouts (4.7% of 
sentences), restarts (5.1% of sentences), cxtralinguistic 
phenomena (6.5% of sentences), ill- formed sentences 
(2.7%) and out of dictionary words (5.7% of sentences). 

By excluding these sentences the rate of WA and 
SU improves to 77.4% and 83.6% respectively. 

7. CONCLUSIONS 

The major advantage of Dialogos is its ability to allow 
a good level of efficiency for users that get good recog- 
nition performance, while the system relies on several 
recovery actions to allow most people with poor recog- 
nition performance to complete successfully their inter- 
actions. 

The experimental results show that most of the 
users were able to give and confirm all the required 
parameters, and that the system acquired those pa- 
rameters with acceptable efficiency: 60% of the users 
did that in less than one minute and 70% in less than 
seven dialogue turns. 

On the basis of the experimental data we can ob- 
serve that the co-operative behavior by the user is es- 
sential: if we eliminate the non co-operative dialogues 



from the corpus, the rate of successful dialogues in- 
creases from 71.7% to 84.5%. This datum suggests 
that in order to obtain realistic evaluations of spoken 
language systems performance, experimentation should 
migrate from the execution of realistic scenarios to the 
use of such systems by real users. 
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